In these exercises, we will be going through the dataset you scraped and preprocess it for further analysis. If you couldn’t scrape your own dataset for some reason, you can use one provided by us: RawComments.Rds. Should you run into any problems, try using the clue boxes or asking us for help. For the solutions, we will be using the prepared dataset; if you are using your scraped one, be sure to replace the names and paths in the code accordingly.

Exercise 1

Load your scraped dataset or the one we prepared for you into your R-session and assign it to a dataframe called comments. Get an overview of the contained variables. What do the variables describe? Why do we have missing data in some of them?

To load the data, you can use the readRDS() function, to get an overview of the contained variables, you can simply use colnames(). To find out more about what the variables mean, you can go to the YouTube data API documentation and search for the comments ouput description.

# Loading dataset
comments <- readRDS("../data/RawComments.rds")

# overview of columns
colnames(comments)

Exercise 2

We want to remove the variables authorProfileImageUrl, authorChannelUrl, authorChannelUrl.value,video_id,canRate and viewerRating and moderationStatus. Create a new dataframe called Selection containing only the remaining variables.

You can use the subset() function to keep or remove a selection of variables from a dataframe. For more information, run ?subset()

# selecting only the columns we need
Selection <- subset(comments,select = -c(authorProfileImageUrl,
                                         authorChannelUrl,
                                         authorChannelId.value,
                                         videoId,
                                         canRate,
                                         viewerRating,
                                         moderationStatus))
# Checking Selection
colnames(Selection)

Exercise 3

Check the class of the variable publishedAt in your new dataframe. Is this class suitable for further analysis? If not, change the class to the appropriate one and compute the time difference in publishing dates between the comment in the first row and the comment in the last row.

Do the same transformation for the variable updatedAt

To check the class of the publishedAt variable, you can use the class() function. To check the formatting of the comment timestamp, you can check the YouTube API documentation. To transfom character strings into datetime objects in R, you can use the base function as.POSIXct() or the more convenient anytime() function from the package with the same name.

# Checking class
class(Selection$publishedAt)

# transforming to datatime object
library(anytime)
Selection$publishedAt <- anytime(Selection$publishedAt,asUTC = TRUE)
class(Selection$publishedAt)

# computing time difference in publishing time
Selection$publishedAt[1] - Selection$publishedAt[dim(Selection)[1]]

# Transforming the updatedAt variable aswell
Selection$updatedAt <- anytime(Selection$updatedAt,asUTC = TRUE)

Exercise 4

Check the likeCount variable in your data, is it suitable for numeric analysis? If not, transform it to the approrpiate class and test whether your transformation worked.

You can use the class() function to check the class of an R-object. To change a class, for example from character to numeric, you can use the family of “as”-functions, for example as.numeric()

# Checking class
class(Selection$likeCount)
## [1] "character"
# Transforming class
Selection$likeCount <- as.numeric(Selection$likeCount)

# rechecking class
class(Selection$likeCount)
## [1] "numeric"
summary(Selection$likeCount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.29    1.00 3903.00

Exercise 5

Check the textOriginal column in your Selection dataframe. There are still hyperlinks in the column that we should remove for later text analysis steps. Extract the hyperlinks from the textOriginal column into a new list called Links. In addition, create a new variable called LinksDel that contains the textOriginal but without the hyperlinks.

The qdabRegex package has many prebuild functions to detect, remove and replace specific character strings. You can for example use the rm_url() function to extract and replace hyperlinks. You can check it’s documentation with ?rm_url() to learn how to extract and how to replace hyperlinks.

# package
library(qdapRegex)

# Checking column
View(Selection$textOriginal)

# extracting hyperlinks
Links <- rm_url(Selection$textOriginal, extract = TRUE)
head(Links,10)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] "https://youtu.be/GWCySrYxov0"
## 
## [[3]]
## [1] NA
## 
## [[4]]
## [1] NA
## 
## [[5]]
## [1] NA
## 
## [[6]]
## [1] NA
## 
## [[7]]
## [1] NA
## 
## [[8]]
## [1] NA
## 
## [[9]]
## [1] NA
## 
## [[10]]
## [1] NA
# removing hyperlinks
LinksDel <- rm_url(Selection$textOriginal)
head(LinksDel,10)
##  [1] "The only people who don't want to answer a citizenship question shouldn't be here."                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
##  [2] "The US Census put out a music video to This is Me to encourage Native Hawaiian and Pacific Islander artists to take part in the census."                                                                                                                                                                                                                                                                                                                                                                                                                  
##  [3] "What is the recent census of population. After recent death, suicide, over dose, storm. American government said they have good news that over dose deaths decreased and the life inspecticy increased. How! When a large number of deaths happen it will decline the numbers meaning half of population is dead... So the recent number would decreased. Again what is the recent number of deaths wow that's a lot it's just February and that's alot. Again how did they die again... Where did they get it... Are you sure it's not a housing crisis."
##  [4] "It's 2020. Still waiting for the census. No mail, no texts, etc. Though if they called, they may have been blocked due to my sister declaring bankruptcy and all the debt collectors calling because i share the last name..."                                                                                                                                                                                                                                                                                                                            
##  [5] "\"do you want me to show you stories of census takers breaking to peoples houses and killing or raping them\" yes please good sir, please show me more than 2 examples of that"                                                                                                                                                                                                                                                                                                                                                                           
##  [6] "He probably wanted to remove the census because he read it costs 15 billion and he’d rather have that as a tax break then be given to.. ugh.. normal people."                                                                                                                                                                                                                                                                                                                                                                                             
##  [7] "“There’s a lot to unpack here” proceeds to unpack one thing that he didn’t say and just makes a stupid joke out of it."                                                                                                                                                                                                                                                                                                                                                                                                                                   
##  [8] "... 0:39 Is that the Crying Indian in the back?"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
##  [9] "6:00 Not just Ohio, New York also lost 2 seats"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [10] "The politicians get to pick their voters with gerrymandering too."

Exercise 6

Check the LinksDel variable to see that there are still emoji contained in the column. For our later analysis, we want to do three things:

  1. Create one column without hyperlinks and emoji for easier text mining
  2. Create one column where emoji are replaced by a textual description for easier text mining
  3. Create one column containing only the textual description of emoji

To achieve this, we first need a dictionary of emoji and their corresponding textual descriptions in a usable format. Load the emo package and have a look at the contained dataframe jis. Copy it to a new dataframe called EmojiList. Afterwards, source the provided CamelCase.R script (in the scripts folder) to transform the textual description from regular case into CamelCase. Finally, create a new variable called TextEmoDel containing the text without the emoji (you can use the ji_replace_all() function from the emo package for that)

We provide you with a function that capitalizes the first character of each word. The function is called simpleCap() and the scripts name is CamelCase.R. You can load it into your workspace using the source() function and specifying it’s location. You can find the function in the scripts folder. Keep in mind that this function is only capitalizing the first letters of each word, you still need to get rid of the extra space characters. The gsub() function is a handy tool for this.

# loading package
library(emo)

# sourcing script
source("../scripts/CamelCase.R")

# Reassigning dataframe
EmojiList <- jis

# Applying the function to all the names
CamelCaseEmojis <- lapply(jis$name, simpleCap)

# Deleting the empty spaces
CollapsedEmojis <- lapply(CamelCaseEmojis,function(x){gsub(" ", "", x, fixed = TRUE)})

# Formatting back from a list to a vector
EmojiList[,4] <- unlist(CollapsedEmojis)

# Overview of first 3 rows
EmojiList[1:10,c(1,3,4)]
##    runes emoji                        name
## 1  1F600     😀                GrinningFace
## 2  1F601     😁  BeamingFaceWithSmilingEyes
## 3  1F602     😂          FaceWithTearsOfJoy
## 4  1F923     🤣   RollingOnTheFloorLaughing
## 5  1F603     😃     GrinningFaceWithBigEyes
## 6  1F604     😄 GrinningFaceWithSmilingEyes
## 7  1F605     😅       GrinningFaceWithSweat
## 8  1F606     😆       GrinningSquintingFace
## 9  1F609     😉                 WinkingFace
## 10 1F60A     😊  SmilingFaceWithSmilingEyes
# Creating text column with removed Emoji (and hyperlinks)
TextEmoDel <- ji_replace_all(LinksDel,"")

Exercise 7

Ultimately, we want to use our EmojiList dataframe to replace the instances of emoji in our text with the textual description. We can do that by looping through all emoji in all texts and replacing them one at a time. There is a problem however: Some emoji are made up of multiple “shorter” emoji. If we match part of a “longer” emoji and replace it with it’s text description, the rest will become unreadble. For this reason, we need to make sure that we replace the emoji from longest to shortest. Sort the EmojiList dataframe by the length of the emoji column from longest to shortest.

You can count the number of characters in a vector of text using the nchar() function. You can reorder dataframes using the order function and you can reverse an order using the rev() function.

# ordering from longest to shortest
EmojiList <- EmojiList[rev(order(nchar(jis$emoji))),]

# Overview of new order
head(EmojiList[,c(1,3,4)],5)
##                                           runes emoji             name
## 1862 1F469 200D 2764 FE0F 200D 1F48B 200D 1F469  👩‍❤️‍💋‍👩 Kiss:Woman,Woman
## 1860 1F468 200D 2764 FE0F 200D 1F48B 200D 1F468  👨‍❤️‍💋‍👨     Kiss:Man,Man
## 1858 1F469 200D 2764 FE0F 200D 1F48B 200D 1F468  👩‍❤️‍💋‍👨   Kiss:Woman,Man
## 3570  1F3F4 E0067 E0062 E0077 E006C E0073 E007F     🏴󠁧󠁢󠁷󠁬󠁳󠁿            Wales
## 3569  1F3F4 E0067 E0062 E0073 E0063 E0074 E007F     🏴󠁧󠁢󠁳󠁣󠁴󠁿         Scotland

Exercise 8

We now have a working dictionary for replacing emoji with a textual description! Create a new variable called TextEmoRep as a copy of the LinksDel variable. Next, loop through the ordered EmojiList and for every element in TextEmoRep, replace the contained emoji with “EMOJI_” followed by their textual description. You can use the rm_default() function from the qdapRegex package to replace custom patterns. Be sure to check the documentation so you can set the appropriate options for the function.

Beware: There will be warnings in your console even if you are doing everything right.

Loop through the dictionary sorted from longest to shortest emoji. You need to use a for loop to go through all emoji for all comments, one at a time. The paste() function is useful for adding the prefix “EMOJI_” in front of your textual descriptions. Don’t forget to set the arguments fixed = TRUE, clean = TRUE and trim = FALSE in your call to rm_default()

# Assigning the column to a new variable
TextEmoRep <- LinksDel

# switching off warnings
options(warn=-1)

# Looping through all Emojis for all comments in New
for (i in 1:dim(EmojiList)[1]) {

  TextEmoRep <- rm_default(TextEmoRep,
                    pattern = EmojiList[i,3],
                    replacement = paste0("EMOJI_",
                                       EmojiList[i,4],
                                       " "),
                    fixed = TRUE,
                    clean = FALSE,
                    trim = FALSE)
}

# checking results
LinksDel[159:171]
##  [1] "A government census website would probably run as well as the VAs 🙄"                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
##  [2] "3:35 wtf? 😂😂😂"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
##  [3] "Now i know why I don't watch this show It just making fun of Republicans"                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
##  [4] "Why you picking on me. It may be slow, but it is chisled and refined."                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
##  [5] "Aww, look how innocent we were back in 1980.😔"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
##  [6] "Illegal immigrants don't count."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
##  [7] "Glenn Yarborough. Geez."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
##  [8] "Four more years! Trump 2020"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
##  [9] "Has anyone else noticed John Oliver looks like if the hypothetical lovechild of Michael Cera and discount Ricky Gervais fucked a green bean"                                                                                                                                                                                                                                                                                                                                                                                                  
## [10] "hopefully it will show how big of a dump Chicago is and how many people fled city and the state. Liberal utopia is gonna lose some seats guaranteed."                                                                                                                                                                                                                                                                                                                                                                                         
## [11] "The census questions literally asks how many toilets you have. Do some research. 1/10th of the cencus takers get the 'long form' with these questions, which if not answered can result in jail time or fines. They also ask how much you pay in rent and utilities, what your level of education is, what your college degree was in, an enumeration of mental illnesses, and so on. They sell this information to non-governmental entities (with the name stripped, however, I'll guarantee google/facebook can piece that back together)."
## [12] "The next question in the US census will be How white are you to be considered American"                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [13] "Even if filling it out indicates wether people are citizens or not & is part of a plot against minorities 🙄, there should be detailed information of who is a citizen & who is not I believe the country deserves as much accurate information it can get, after all aren’t we the leading nation in the world?"
TextEmoRep[159:171]
##  [1] "A government census website would probably run as well as the VAs EMOJI_FaceWithRollingEyes "                                                                                                                                                                                                                                                                                                                                                                                                                                                 
##  [2] "3:35 wtf? EMOJI_FaceWithTearsOfJoy EMOJI_FaceWithTearsOfJoy EMOJI_FaceWithTearsOfJoy "                                                                                                                                                                                                                                                                                                                                                                                                                                                        
##  [3] "Now i know why I don't watch this show It just making fun of Republicans"                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
##  [4] "Why you picking on me. It may be slow, but it is chisled and refined."                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
##  [5] "Aww, look how innocent we were back in 1980.EMOJI_PensiveFace "                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
##  [6] "Illegal immigrants don't count."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
##  [7] "Glenn Yarborough. Geez."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
##  [8] "Four more years! Trump 2020"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
##  [9] "Has anyone else noticed John Oliver looks like if the hypothetical lovechild of Michael Cera and discount Ricky Gervais fucked a green bean"                                                                                                                                                                                                                                                                                                                                                                                                  
## [10] "hopefully it will show how big of a dump Chicago is and how many people fled city and the state. Liberal utopia is gonna lose some seats guaranteed."                                                                                                                                                                                                                                                                                                                                                                                         
## [11] "The census questions literally asks how many toilets you have. Do some research. 1/10th of the cencus takers get the 'long form' with these questions, which if not answered can result in jail time or fines. They also ask how much you pay in rent and utilities, what your level of education is, what your college degree was in, an enumeration of mental illnesses, and so on. They sell this information to non-governmental entities (with the name stripped, however, I'll guarantee google/facebook can piece that back together)."
## [12] "The next question in the US census will be How white are you to be considered American"                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [13] "Even if filling it out indicates wether people are citizens or not & is part of a plot against minorities EMOJI_FaceWithRollingEyes , there should be detailed information of who is a citizen & who is not I believe the country deserves as much accurate information it can get, after all aren’t we the leading nation in the world?"

Exercise 8

We now have the original text column, and the text column with removed hyperlinks and were emoji are replaced with their textual descriptions (TextEmoRep). We need one more variable that only contains the textual desciptions of the emoji. You can use our predefined function ÈxtractEmoji() from the scripts folder to create this variable.

Use the source() function to source the ExtractEmoji.R script from the scripts folder and then sapply() the ExtractEmoji() function to the variable TextEmoRep. To remove useless rownames from the extracted Emojis, you can set names(Emoji) to NULL

# sourcing function
source("../scripts/ExtractEmoji.R")

# Using function
Emoji <- sapply(TextEmoRep,ExtractEmoji)
names(Emoji) <- NULL

# checking results
TextEmoRep[39]
## [1] "More bullshit EMOJI_FaceWithMedicalMask EMOJI_NauseatedFace "
Emoji[39]
## [1] "EMOJI_FaceWithMedicalMask EMOJI_NauseatedFace "

Exercise 8

We now have selected all the variables we need, formatted them into the right formats, cleaned the text and extracted some additional information from it. Create a new dataframe called df that contains the following variables:

  • Selection$authorDisplayName

  • Selection$textOriginal

  • TextEmoRep

  • TextEmoDel

  • Emoji

  • Selection$likeCount

  • Links

  • Selection$publishedAt

  • Selection$updatedAt

  • Selection$parentId

  • Selection$id

Set the following names for the column in the new dataframe:

  • Author

  • Text

  • TextEmojiReplaced

  • TextEmojiDeleted

  • Emoji

  • LikeCount

  • URL

  • Published

  • Updated

  • ParentId

  • CommentID

Save the new dataframe as an RDS object with the name “ParsedComments.Rds”

You can use the cbind.data.frame() function to paste together multiple columns to a dataframe. You need to set the argument stringsAsFactors = FALSE though, to prevent strings from being interpreted as factor variables. In addition, the variables Links and Emoji are lists and can contain multiple values per row. For this reason, we need to enclose them with the I() function to be able to put them into a dataframe. You can save your result using the saveRDS() function.

# creating df dataframe (use I() function to enclose Emoji and Links)
df <- cbind.data.frame(Selection$authorDisplayName,
                       Selection$textOriginal,
                       TextEmoRep,
                       TextEmoDel,
                       I(Emoji),
                       Selection$likeCount,
                       I(Links),
                       Selection$publishedAt,
                       Selection$updatedAt,
                       Selection$parentId,
                       Selection$id,
                       stringsAsFactors = FALSE)

# setting column names
names(df) <- c("Author",
               "Text",
               "TextEmojiReplaced",
               "TextEmojiDeleted",
               "Emoji",
               "LikeCount",
               "URL",
               "Published",
               "Updated",
               "ParentId",
               "CommentID")

# deleting row names
row.names(df) <- NULL

# saving dataframe
saveRDS(df, file = "../data/ParsedComments.rds")